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Abstract 

We present a parallel version of the well-known Split-Step Fourier method (SSF) 
for solving the Nonlinear Schrodinger equation, a mathematical model describing 
wave packet propagation in fiber optic lines. The algorithm is implemented un- 
der both distributed and shared memory programming paradigms on the Silicon 
Graphics/Cray Research Origin 200. The ID Fast-Fourier Transform (FFT) is par- 
allelized by writing the ID FFT as a 2D matrix and performing independent ID 
sequential FFTs on the rows and columns of this matrix. We can attain almost 
perfect speedup in SSF for small numbers of processors depending on both problem 
size and communication contention. The parallel algorithm is applicable to other 
computational problems constrained by the speed of the ID FFT. 
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1 Introduction 



The Nonlinear Schrodinger equation (NLSE) 

iAt + (yd^A^^ + A*A^ = G, (1) 

is a nonlinear partial differential equation that describes wave packet propagation 
in a medium with cubic nonlinearity. Technologically, the most important application 
of NLSE is in the field of nonlinear fiber optics |l|, §]. The parameter a specifies the 
fiber anomalous group velocity dispersion (cr = 1) or normal group velocity dispersion 
(cr = —1), while the parameter d defines the normalized absolute value of the fiber's 
dispersion. The perturbation G is specified by details of the physical fiber being studied. 

In the special case G = 0, NLSE is integrable 0] and can be solved analytically. In 
general if G 7^ NLSE must be solved numerically. One of the most popular numerical 
methods to solve the perturbed NLSE is the Split-Step Fourier method (SSF) 0. For 
small-scale calculations, serial implementations of SSF are adequate; however, as one 
includes more physics in the simulation, the need for large numbers of Fourier modes to 
accurately solve NLSE equation demands parallel implementations of SSF. 

Many fiber optics problems demand large-scale numerical simulations based on the 
SSF method. One class of such problems involves accurate modeling of wave-length divi- 
sion multiplexed (WDM) transmission systems where many optical channels operate at 
their own frequencies and share the same optical fiber. WDM is technologically impor- 
tant as it is one of the most effective ways to increase the transmission capacity of optical 
lines 1^, 1^ . To accurately model WDM one needs to include large numbers of Fourier 
harmonics in the numerical simulation to cover the entire transmission frequency band. 
Moreover, in WDM systems different channel pulses propagate at different velocities and, 
as a result, collide with each other. At the pulse collision. Stokes and anti-Stokes side- 
bands are generated; these high frequency perturbations lead to signal deterioration 0. 
Another fundamental nonlinear effect called four-wave mixing (FWM) [0] must be accu- 
rately simulated as the FWM components broaden the frequency domain which requires 
even larger numbers of Fourier modes for accurate numerical simulation. 

To suppress the FWM ^ and make possible the practical realization of WDM, 
one can use dispersion management (concatenation of fiber links with variable dispersion 
characteristics). The dispersion coefficient d in NLSE is now no longer constant but 
represents a rapidly varying piecewise constant function of the distance down the fiber. 
As a result, one must take a small step size along the fiber to resolve dispersion variations 
and the corresponding pulse dynamics. A final reason to include a large number of Fourier 
modes in numerical simulations is to model the propagation of pseudorandom data streams 
over large distances. 

All of the above factors make simulation of NLSE quite CPU intensive. Serial versions 
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of the split-step Fourier method in the above cases may too be slow even on the fastest 
modern workstations. To address the issue of accurately simulating physical telecom- 
munication fibers in a reasonable amount of time, we discuss the parallelization of SSF 
algorithm for solving NLSE. Our parallel SSF algorithm is broadly applicable to many 
systems and not limited to the solution of NLSE. We consider an algorithm appropri- 
ate for multiprocessor workstations. Parallel computing on multiprocessor systems raises 
complex issues including solving problems efficiently with small numbers of processors, 
limitations due to the increasingly complex memory hierarchy, and the communication 
characteristics of shared and distributed multiprocessor systems. 

Current multiprocessors have evolved towards a generic parallel machine, which shares 
characteristics of both shared and distributed memory computers. Therefore most com- 
mercial multiprocessors support both shared memory and distributed memory program- 
ming paradigms. The shared memory paradigm consists of all processors being able to 
access some amount of shared data during the program execution. This addressing of 
memory on different nodes in shared memory multiprocessors causes complications in 
writing efficient code. Some of the most destructive complications are: cache hierarchy 
inefficiency (alignment and data locality), false sharing of data contained in a cache block, 
and cache thrashing due to true sharing of data. Most vendors provide compiler directives 
to share data and divide up computation (typically in the form of loop parallelism) which 
in conjunction with synchronization directives can be used to speed up many sequential 
codes. In distributed memory programming, each processor works on a piece of the com- 
putation independently and must communicate the results of the computation to the other 
processors. This communication must be written explicitly into the parallel code, thus 
requiring more costly development and debugging time. The communication is typically 
handled by libraries such as the message passing interface (MPI) which communicates 
data through Ethernet channels or through the existing memory system. Our primary 
goal is to present a parallel split-step Fourier algorithm and implement it under these 
two different parallel programming paradigms on a 4-processor Silicon Graphics/Cray 
Research Origin 200 multiprocessor computer. 

The remainder of this paper is organized as follows. In Section 2, we recall a few basics 
of the the split-step Fourier method. In Section 3, we introduce the parallel algorithm for 
SSF. Timing results and conclusions are given in Sections 4 and 5, respectively. 

2 Split-Step Fourier Method 

The Split-Step Fourier (SSF) method is commonly used to integrate many types of non- 
linear partial differential equations. In simulating Nonlinear Schrodinger systems (NLS) 
SSF is predominantly used, rather than finite differences, as SSF is often more efficient 
We remind the reader of the general structure of the numerical algorithm 0. 
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NLS can be written in the form: 



'4 = iL^N)A, 

where L and are hnear and nonhnear parts of the equation. The solution over a short 
time interval r can be written in the form 

A{t + T,x) ^ exp(TL) eiq){TN)A{t, x) 

where the linear operator in NLS acting on a spatial field B{t,x) is written in Fourier 
space as, 

exp{TL)B{t,x) = F-^ eicp{-ik'^T)FB{t, x) 

where F denotes Fourier transform (FT), F~^ denotes the inverse Fourier transform, and 
k is the spatial frequency. 

We split the computation of A over time interval r into 4 steps: 

Step 1. Nonlinear step: Compute Ai = exp{TN)A{t,x) (by finite differences); 
Step 2. Forward FT: Perform the forward FFT on A^: A^^ F A^; 
Step 3. Linear step: Compute A^ = exp(rL)A2; 

Step 4. Backward FT: Perform the backward FFT on ^13: A{t + r) = F'^Aa; 

To discretize the numerical approximation of the above algorithm, the potential A is 
discretized in the form: Ai — A(lh); I — 0, . . . , N — 1, where h is the space-step and N is 
the total number of spatial mesh points. 

The above algorithm of the Split-Step Fourier (SSF) method is the same for both se- 
quential and parallel code. Parallel implementation of this algorithm involves parallelizing 
each of the above four steps. 

3 The Parallel Version of the Split-Step Fourier (SSF) 
Method 



By distributing computational work between several processors, one can often speed up 
many types of numerical simulations. A major prerequisite in parallel numerical algo- 
rithms is that sufficient independent computation be identified for each processor and 
that only small amounts of data are communicated between periods of independent com- 
putation. This can often be done trivially through loop-level parallelism (shared memory 
implementations) or non-trivially by storing true independent data in each processor's 
local memory. For example, the nonlinear transformation in the SSF algorithm involves 
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the independent computation over subarrays of spatial elements of A{1). Therefore P 
processors each will work on sub-arrays of the field A, e.g., the first processor updates Aq 
to A(7v/p-i), the second processor updates A^/p to A2(n/p)-i, etc. 

In the ID-FFT, elements of {FA)k can not be computed in a straightforward parallel 
manner, because all elements Ai are used to construct any element of (-FA)^. The problem 
of ID FFT parallelization has been of great interest for vector [0, |Tl| and distributed 
memory computers |T^. These algorithms are highly architecture dependent, involving 
efficient methods to do the data re-arrangement and transposition phases of the ID FFT. 
Communication issues are paramount in ID FFT parallelization and in the past have 
exploited classic butterfly communication patterns to lessen communication costs ||12|| . 
However, due to a rapid change in parallel architectures, towards multiprocessor sys- 
tems with highly complex memory hierarchies and communication characteristics, these 
algorithms are not directly applicable to many current multiprocessor systems. Shared 
memory multiprocessors often have efficient communication speeds, and we therefore im- 
plement the parallel ID FFT by writing Ai as a two dimensional array, in which we can 
identify independent serial ID FFTs of rows and columns of this matrix. The rows and 
columns of the matrix A can be distributed to divide up the computation among several 
processors. Due to efficient communication speeds, independent computation stages, and 
the lack of the transposition stage of the ID FFT in SSF computations, we show that 
this method exploits enough independent computation to result in a significant speedup 
using a small number of processors. 

3.1 Algorithm of Parallel SSF 

The difficulty parallelizing the split-step Fourier algorithm is in steps 2 and 4, as the 
other two steps can be trivially evolved due to the natural independence of the data A 
and A2. In Step 2 and Step 4 there are non-trivial data dependences over the entire 
range <= / <= of Ai{l) and A^ll) which involve forward and backward Fourier 
Transforms (FFT and BFT). The discrete forward Fast Fourier Transform (FFT) is of 
the form 

F(fc) = g A(0 exp [-^j 

which requires all elements of A{1). Several researchers have suggested parallel ID Fast 
Fourier Transform algorithms [llO| , |lT], 0], but to date there exist no vendor-optimized 
parallel ID FFT algorithms. Therefore implementations of these algorithms are highly 
architecture dependent. Parallel ID FFT algorithms must deal with serious memory 
hierarchy and communication issues in order to achieve good speedup. This may be the 
reason why vendors have been slow to support the computational community with fast 
parallel ID FFT algorithms. However, we show below that we can get significant parallel 
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speedup due to the elimination of the transposition stage in ID FFT for SSF methods 
and due to exploitation of independent computation by performing many sequential ID 
FFTs on small subarrays of A{1). 

Our method of parallelizing the SSF algorithm requires dividing the ID array A[l) 
into subarrays which are processed using vendor optimized sequential ID FFT routines. 
We assume the ID array A is of the dimension N of the product of two integer numbers: 
N = Mq X Ml. Therefore A can be written as a 2D matrix of size Mq x Afi. As a result, 
we can reduce the expression for the Fourier transform of the array A to the form 

Mo-iMi-i / -27ri \ 

F{Mih + ko)^ E MMoni + no)exp ——{Morii + no){Mih + ko) 



no=0 ni=0 



N / -27ri , \ /-2m , \ 
= 2^ f{ko, no)exp [j^J^^^oko j exp [-j^i^oki j (2) 

where F is the Fourier transform of A and / is the result of Mi- size Fourier transform 
of A{Moni + no) with fixed no 

/-2ni \ 

f{ko, no) = ^ A{Moni + no)exp -^ri—riiko (3) 

ni=0 ^ ^^^1 ^ 

no,A;i =0,...,Mo-l m, /cq = 0, Mi - 1. (4) 

The reduced expression Eq. (2) demonstrates that the N — Mo * Mi Fourier transform F 
is obtained by making Mq size Fourier transforms of f{ko, no)exp ^^^no/co) for fixed ko- 

Therefore the ID array A is written as a 2D matrix ajk of size Mq x Mi with elements 
(A(0), A{Mo - 1)) in the first column, {A{Mo), A{2Mo - 1)) in the second column, 
etc. We use this matrix ajk in our parallel FFT-algorithm: 

Step 1. Independent Mi-size FFTs on rows of ajk- 

Step 2. Multiply elements a{j, k) by a factor Ejj^ — exp{—{2m/N) ■ j ■ k) 
Step 3. Independent Mo-size FFTs on columns of ajk- 

The result of Step 1 - Step 3 is the N — Mo* Mi ID Fourier transform of A stored 
in rows: (F(0), F(Mi - 1)) in the first row, (F(Mi), F(2Mi - 1)) in the second row, 
and so on. To regain the proper ordering of A (how elements were originally stored in 
matrix ajk) requires a transposition of the matrix which is the last step in a parallel FFT 
algorithm. 

In the SSF method, the transposition is not necessary as we apply a linear opera- 
tor L{k) and then take the steps: Step 1 - Step 3 in reverse order. This avoids the 
transposition because one can define a transposed linear operator array and multiply ajk 
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by this transposed linear operator. Then Step 1 - Step 3 are performed in reverse order 
with the conjugate of the exponential term in Step 2. 



The complete SSF parallel algorithm consists of the following steps: 



Step 1. nonlinear step 
Step 2. row-FFT 
Step 3. multiply by E 
Step 4. column-FFT 

Step 5. linear step (transposed linear operator) 
Step 6. column-BFT 

Step 7. multiply by E* (the complex conjugate of E) 
Step 8. row-BFT 



The parallelization is due to the natural independence of operations in steps 1, 3, 
5, and 7 and by the row and column subarray FFTs in steps 2, 4, 6, and 8. The row 
and column subarray FFTs of size Mi and Mq are performed independently with serial 
optimized ID FFT routines. Working with subarray data, many processors can be used to 
divide up the computation work resulting in significant speedup if communication between 
processors is efficient. Further, smaller subarrays allows for better data locality in the 
primary and secondary caches. The implementation details of the shared- memory and the 
distributed memory parallel SSF algorithm outlined above depend on writing Steps 1 
- 8 using either shared memory directives or distributed memory communication library 
calls (MPI). 



3.2 Shared Memory Approach 

Much of the SSF parallel algorithm outhned above can be implemented with "$doacross" 
directives to distribute independent loop iterations over many processors. The FFTs of 
size Mq and Mi arc implemented by distributing the ID subarray FFTs of rows and 
columns over the P available processors. The performance can be improved drastically 
by keeping the same rows and columns local in a processor's secondary cache to alleviate 
true sharing of data from dynamic assignments of sub-array FFTs by the "$doacross" 
directive. The subarray FFTs are performed using vendor optimized sequential ID FFT 
routines which are designed specifically for the architecture. 

It is efficient to perform all column operations (Steps 3 - 7) in one pass: copying a 
column into local sub-array S contained in the processor's cache and in order, multiply 
by the exponents in Step 3, perform the Mo-size FFT of S, multiply by the transposed 
linear operator exp(TL), invert the Mo-size FFT, multiply by the conjugate exponents. 
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and finally store S back into the same column of a. This allows for efficient use of the 
cache, reducing false/true sharing as we perform many operations on each subarray. 

3.3 Distributed Memory Approach 

The Massage Passing Interface (MPI) is a tool for distributed parallel computing which 
has become a standard used on a variety of high-end parallel computers to weakly coupled 
distributed networks of workstations (NOW) p[. In distributed parallel programming, 
different processors work on completely independent data and explicitly use send and 
receive library calls to communicate data between processors. 

To implement the distributed parallel SSF algorithm for the Nonlinear Schrodinger 
system (NLS), one needs to distribute the rows of array A among all P available processors. 
Then Steps 1-3 can be executed without communication between processors. After 
these steps, it is necessary to endure the communication cost of redistributing the elements 
of A among the P processors. Each processor must send a fraction of its data to each 
of the other processors. Then each processor will have the correct data for Steps 4 -7 
and column operations are performed independently on all P processors. Finally, there is 
a second redistribution prior to Step 8. To make T steps of the SSF algorithm, we use 
the following scheme: 

subroutine Distributed SSF 

distribute rows among processors 

Step 1. nonlinear step 

Step 2. row-FFT 

Step 3. multiply by a factor E 

for i = 1 to T - 1 do 

data redistribution 

Step 4. column-FFT 

Step 5. linear step 

Step 6. column-BFT 

Step 7. multiply by a factor E* 

data redistribution 

Step 8. row-BFT 

Step 1. nonlinear step 

Step 2. row-FFT 

Step 3. multiply by a factor E 
end do 

data redistribution 
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Step 4. column-FFT 

Step 5. linear step 

Step 6. column-BFT 

Step 7. multiply by a factor E* 

Step 8. row-BFT 

end 

The large performance cost in this algorithm is the redistribution of data between row 
and column operations. If the row and column computational stages result in significant 
speedup compared to the communication expense of redistributing the matrix data, then 
this algorithm will be successful. This depends crucially on fast communication between 
processors which is usually the case for shared memory multiprocessors and less so for 
NOW computers. 

4 Results 

We performed timings of the parallel SSF algorithm on the Sihcon Graphics/Cray Re- 
search Origin 200 multiprocessor. The Origin 200 was used because it allows for both 
shared and distributed memory parallel programming and models a generic multiproces- 
sor. The Origin 200 is efficient at fine-grained parallelism which typically makes shared 
memory programming both efficient and easy. The Origin 200 workstation used in this 
study consisted of four MIPS RIOOOO 64-bit processors (chip revision 2.6) with MIPS 
RIOOIO (revision 0.0) floating point units running at 180MZ. The primary cache con- 
sisted of a 32KB 2-way set-associative instruction cache and a 32KB 2-way set-associative 
data cache. Each processor also had a 1MB 2-way set-associative secondary cache. The 
machine had a sustained 1.26GB/sec memory bandwidth and 256MB of RAM. 

The operating system was IRIX 6.4. We used a Mongoose f77 version 7.10 Fortran 
compiler. For the parallel programming we used MPl version 1.1 for the distributed 
computing and the native "Sdoacross" and synchronization directives provided by the f77 
compiler for shared memory programming. All timings are of the total wall-clock time 
for the code to both initialize and execute. 

4.1 Timings 

For the following timings, we use Mq = Mi = 2^, so that the entire ID array is of size 
N — 2^^. The one-processor implementation of parallel SSF was 10% to 20% faster 
than serial SSF code using vendor optimized ID FFTs of the entire array of size N — 
2^^ . This improvement is due to better cache coherence using smaller subarrays, as an 
entire subarray can be contained in the LI cache and is due to the fact that the single 
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processor parallel SSF does not do the transposition stage of the ID FFT. All timings 
are compared to the one-processor parallel code at the same optimization level (compared 
to sequential SSF the below speedups are even more impressive). For shared memory 
parallel implementations, we find over the range of 2^^ < N < 2^^ that two node SSF 
implementations have good speedup (SU) with a maximum speedup at = 2^^. Using 
four nodes, for small array sizes we have 1/4 less work per processor, but more contention 
due to the sharing of pieces of data contained in the secondary caches of four different 
processors. At N — 2^^, we again see the maximum speedup (now for 4 nodes), reflecting 
that the ratio of computational speed gain to communication contention is optimal at this 
problem size. 



Shared Memory 



array size (N) 


iV = 212 


iV = 2^4 


N = 2^6 


iV = 2i8 


number of steps (T) 


T = 8000 


T = 2000 


T = 500 


T = 125 


Tlpr, (sec) 


49.5 


51.5 


65.5 


97.5 


T2pr (sec) 


29.5 


30.5 


33.5 


61.0 


T^pr (sec) 


19.5 


18.5 


19.5 


34.5 


SU = T2pr/Tipr, 


1.7 


1.7 


2.0 


1.6 


SU = T^py. /Tlpr, 


2.5 


2.8 


3.4 


2.8 



Under the shared memory programming model, subarrays are continually distributed 
among processors to divide up the computational work. Data in a single subarray may be 
contained on one or more processors requiring constant communication. The data con- 
tained in each processor's L2 cache is of size 0(N/P), where P is the number of processors. 
Contention in the memory system is modeled as being proportional to 0{{N/ P)'^) which 
refiects the communication congestion for sharing data of large working sets. Further 
unlike the serial code, the parallel code endures a communication time to send data be- 
tween processors proportional to 0{N/P)tc, where Tc is the time to transfer a L2 cache 
block between processors. Finally, the time to perform the ID FFT is approximately 
NLog{N)TM, where tm is the time to perform a fioating point operation. A simple for- 
mula for the speedup (SU) of the shared memory FFT is 

SU = TMNLogiN) 

rMNLog{VN)/P + rcN/P + f{N/P)^y 

where / is a small number reflecting contention in the communication system. If A^ = 2^ 
we can simplify the above expression, 

2P 

= {l + C/K + f2^/{PK)y 
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where the constants are absorbed into / and ^. With / = (no contention) one predicts for 
fixed P that the speedup increases for larger and larger problem size N. However, for / 7^ 
the speedup eventually decreases with larger N due to contention of communicating small 
pieces of subarray data between arbitrary processors. This equation reflects the trend 
seen in our empirical data of speedup for shared memory SSF, where speedup attains a 
maximum with problem size at = 2^^. 

The above SU formula must be reinterpreted for distributed SSF due to the implicit in- 
dependent computational stages where no data is communicated between processors unlike 
shared memory SSF. Distributed SSF uses communication stages to send data between 
processors and does not involve contention due to sharing data between P processors 
during computation stages. Distributed MPl timings are compared to a one-processor 
MPI code at the same optimization level. The MPl one-processor code was faster than 
one-processor shared memory code, as it did not have synchronization steps. The par- 
allel timings were typically faster than the shared memory parallel code, except for the 
N = 2^^ array size for which the shared memory code did slightly better. We find that 
for distributed memory parallel implementations of SSF over the range of 2^^ < < 2^^ 
two-node implementations have good speedup with maximum speedup at A^ = 2^^, be- 
yond which the communication cost increases and the computation/communication ratio 
decreases for larger problem size. The communication cost is different in the MPI case 
than for shared memory, as data is communicated in "communication stages" so less than 
perfect speedup (SU) is due to the volume of data communicated between processors 
in redistribution stages. Using four nodes, we find that the speedup increases with the 
working set N. This is due to both making the computation stages faster 0(NLog(N)/8) 
and by communicating only 0(N/16) of data between single processors in the redistri- 
bution stage. For small problem size there is not enough work to make dividing the 
problem among 4 processors beneficial. The speedup in the distributed SSF algorithm 
is attributed to the independence of data contained in a processor's local cache between 
data rearrangement stages, which is not true for the dynamic assignment and sharing of 
subarray data throughout computational stages in shared memory SSF implementations. 

Distributed Memory (MPI) 



array size (N) and 


AT = 2^2 


A^ = 2i4 


A^ = 2^6 


N^2^^ 


number of steps (S) 


S = 8000 


S = 2000 


S = 500 


S = 125 


Tipr. (sec) 


37.9 


44.5 


59.4 


92.4 


T2pr (sec) 


24.7 


25.4 


34.9 


65.9 


Tipr (sec) 


18.8 


16.3 


20.1 


26.8 


SU = T2pr/Tipr, 


1.5 


1.8 


1.7 


1.4 


SU = T^pr/Tipr, 


2.0 


2.7 


3.0 


3.4 
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These results are encouraging in that the speedup in multiprocessor SSF implementa- 
tions is considerable. Speedup over sequential code using vendor optimized full array ID 
FFT is even greater. We recommend implementing the parallel SSF algorithm even on 
sequential machines due to the 10% to 20% speedup over optimized ID sequential SSF 
algorithms. This reflects a better use of the LI cache and data locality by using small 
subarrays and removing the transposition stage of the ID FFT in SSF. For shared mem- 
ory implementations of parallel SSF, the maximum speedup requires balancing contention 
in the communicating data contained over more than one processor to the computation 
performance gain of using small subarrays. For the distributed parallel SSF there is more 
data locality as data is distributed statically prior to the computational stages. This divi- 
sion of computation and communication stages is different than for shared memory SSF 
which dynamically distributes subarray FFTs and shares data on more than one proces- 
sor. Distributed SSF speedup is a function of the number of processors P which reduces 
both the computational time and communication volume between single processors. The 
speedup of the parallel SSF is strongly dependent on reducing communication time and 
contention in the multiprocessor. 

5 Conclusions 

Multiprocessor systems occupy the middle ground in computing between sequential and 
massively parallel computation. In multiprocessor computing, one wants to write code to 
take advantage of between 2 and 16 processors to get good speedups over sequential code. 
Our parallel SSF method is designed for small numbers of tightly integrated processors to 
divide the ID FFT into many subarray FFTs performed on P processors. The speedup 
depends on optimizing the computational speed gain to communication cost in order 
to speedup traditionally sequential numerical code. The shared memory parallel SSF 
algorithm does not scale with problem size as subarray data is distributed over more 
than one processor causing increases in contention due to gathering large amounts of 
subarray data from many processors. The distributed memory parallel SSF algorithm uses 
independent data during computational stages and then uses expensive data redistribution 
stages. The communication cost of the data redistribution stages can be reduced by using 
more processors, which also decreases the time for the computation stage. Our results 
suggest that nearly perfect speedup can be achieved over sequential SSF algorithms by 
tuning the number of processors and problem size. The significant speedup over sequential 
code is broadly applicable to many sorts of code which depend crucially on speeding up 
the sequential ID FFT and should be explored for other numerical algorithms. 
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